Data-dependent bucketing improves reference-free compression of sequencing reads
نویسندگان
چکیده
MOTIVATION The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. RESULTS We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. AVAILABILITY AND IMPLEMENTATION Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
منابع مشابه
Reference-based compression of short-read sequences using path encoding
MOTIVATION Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. RESULTS We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and ...
متن کاملCompression of high throughput sequencing data with probabilistic de Bruijn graph
Motivation: Data volumes generated by next-generation sequencing technologies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip. Most reference-free tools developed for NGS data compression still use general text compression methods and fail to benefit from algorithms...
متن کاملQuark enables semi-reference-based compression of RNA-seq data
Motivation The past decade has seen an exponential increase in biological sequencing capacity, and there has been a simultaneous effort to help organize and archive some of the vast quantities of sequencing data that are being generated. Although these developments are tremendous from the perspective of maximizing the scientific utility of available data, they come with heavy costs. The storage...
متن کاملCompression of short-read sequences using path encoding
Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale t...
متن کاملCompression of short-read sequences using path encoding
Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 31 شماره
صفحات -
تاریخ انتشار 2015